Spark Sql

Spark SQL is a module for structured data processing, which is built on top of core Apache Spark.Spark SQL is a Spark module for structured data processing.Spark SQL blurs the line between RDD and relational table. It offers much tighter integration between relational and procedural processing, through declarative DataFrame APIs which integrates with Spark code. It also provides higher optimization. DataFrame API and Datasets API are the ways to interact with Spark SQL.

Apache Spark SQL provides the following:
  • DataFrame API: It is a library for working with data as tables
  • Defining Data Frames: Defines Data Frames containing Rows and Columns
  • Catalyst Optimizer: It is an extensible optimization framework A SQL Engine and Command Line Interface.
The features of Apache Spark SQL are:
  • Hive Compatibility: Compatible with the existing Hive queries, UDFs, and data.
  • SQL Queries Support: Mixes SQL queries with Spark programs.
  • Components Support: Includes a cost-based optimizer, code generations, and columnar storage.
  • Spark Engine Inclusion: Allows it to scale to multi-hour queries and thousands of nodes.
  • Loading data from a variety of structured sources.
  • Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using business intelligence tools like Tableau. 
  • Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more.
What are the various data source available in spark Sql ?
  • Parquet file
  • JSON Datasets
  • Hive tables
SparkSQL is a Spark component that supports querying data either via SQL or via the Hive Query Language. It originated as the Apache Hive port to run on top of Spark (in place of MapReduce) and is now integrated with the Spark stack. In addition to providing support for various data sources, it makes it possible to weave SQL queries with code transformations which results in a very powerful tool. Below is an example of a Hive compatible query.

Catalyst Optimizer
Catalyst Optimizer supports both rule-based and cost-based optimization. In rule-based optimization the rule based optimizer use set of rule to determine how to execute the query. While the cost based optimization finds the most suitable way to carry out SQL statement. In cost-based optimization, multiple plans are generated using rules and then their cost is computed.

Run SQL on files directly
Instead of using read API to load a file into DataFrame and query it, you can also query that file directly with SQL.

spark.sql(" select * from parquet.`/Data/names.parquet` ").show
spark.sql(" select * from json.`namejson` ").show
spark.sql(" select * from orc.`/Data/nameorc` ").show
 

Working with spark Sql ?

Data = [           ("James", "Sales", 3000), \
                        ("Michael", "Sales", 4600), \
                        ("Michael", "Sales", 4600), \
                        ("Michael", "Sales", 4600), \
                        ("Robert", "Sales", 100), \
                        ("Maria", "Finance", 3000), \
                        ("James", "Sales", 3000), \
                        ("Scott", "Finance", 3300), \
                        ("Jen", "Finance", 3900), \
                        ("Jen", "Finance", 3900), \
                        ("Jeff", "Marketing", 3000), \
                        ("Kumar", "Marketing", 2000), \
                        ("Kumar", "Marketing", 2000), \
                        ("Kumar", "Marketing", 2000), \
                        ("Saif", "Sales", 4100)]
columns= ["empno","job", "sal"]
df = spark.createDataFrame(data = Data, schema = columns)
df.createOrReplaceTempView("emp")
df2 = spark.sql("SELECT empno,job,sal FROM emp")
df2.show()



No comments:

Post a Comment